## Author: Kiril Boyanov (kirilboyanov [at] gmail.com)
## LinkedIn: www.linkedin.com/kirilboyanov/
## Last update: 2023-12-08
In this file, we explore the correlations between happiness and a series of economic, political, societal, environmental and health-related factors. We perform this investigation both by using the most recent data and by looking at historical data so as to see whether the correlations change across time. Additionally, we explore within-country correlations to see whether the strongly correlated factors are mostly the same or different for each country.
Importing relevant packages, defining custom functions, specifying local folders etc.
# Importing relevant packages
# For general data-related tasks
library(plyr)
library(tidyverse)
library(data.table)
library(openxlsx)
library(readxl)
library(arrow)
# For dealing with missing values
library(mice)
# For working with countries
library(countrycode)
# For data visualization
library(ggplot2)
library(plotly)
Throughout the analysis, we will be using a common
BaseYear (to represent the past state of happiness) and a
common ReferenceYear (to represent the most recent state of
happiness). To ensure consistency across files, these two years are
stored in a TXT file, which is imported below.
Thus, we use the following years as base and reference:
## Base year: 2005
## Reference year: 2022
We import data that was already pre-processed in the
WHR_data_prep.Rmd notebook. In here, we use many different
data sources, previews of which are available in the following
sub-sections.
| Country | CountryCode | Year | HappinessScore | RowID | CountryRank | Continent | Region |
|---|---|---|---|---|---|---|---|
| Finland | FIN | 2022 | 7.8210 | FIN_2022 | 1 | Europe | Europe & Central Asia |
| Denmark | DNK | 2022 | 7.6362 | DNK_2022 | 2 | Europe | Europe & Central Asia |
| Iceland | ISL | 2022 | 7.5575 | ISL_2022 | 3 | Europe | Europe & Central Asia |
| Switzerland | CHE | 2022 | 7.5116 | CHE_2022 | 4 | Europe | Europe & Central Asia |
| Netherlands | NLD | 2022 | 7.4149 | NLD_2022 | 5 | Europe | Europe & Central Asia |
Note that we have several similar measures, e.g. GDP in constant vs. current prices. While they convey what is essentially the same information, it’s too early to remove them: we need to see which ones are the most strong correlates to make this decision.
| RowID | CountryISO3 | Country | Year | P_ControlOfCorruption | P_PoliticalStability | P_RuleOfLaw | P_VoiceAndAccountability | P_GovernmentEffectiveness | P_CorruptionPerceptionIndex | P_ElectoralDemocracyIndex | P_FreedomOfExpression | P_FreedomOfAssociation | P_PopulationPctWithSuffrage | P_CleanElectionsIndex | P_ElectedOfficialsIndex | P_ConflictsFatalityPerCountry | S_TotalPopulation | S_PopulationAged14OrLess | S_PopulationAged15To64 | S_PopulationAged65OrMore | S_NetMigration | S_UrbanPopulation | S_UrbanPopPctOfTotal | S_TotalHomicideRate | S_FemaleHomicideRate | S_MaleHomicideRate | S_FemaleBankingAccessPctOfPop | S_FemaleSchoolDropoutRate | S_FemaleManagementPctOfTotal | S_LaborParticipRateFemale | S_LaborForcePctFemale | S_FemaleLiteracyRate | S_AccessToCleanFuelsPctOfTotal | S_AccessToCleanFuelsPctOfRural | S_AccessToCleanFuelsPctOfUrban | S_AccessToElectricityPctOfTotal | S_AccessToElectricityPctOfRural | S_AccessToElectricityPctOfUrban | S_CompulsoryEducationYears | S_PreprimaryEducationYears | S_PrimaryEducationYears | S_SecondaryEducationYears | S_BScAttainedPopAged25OrMore | S_LowerSecEduPctOfTotal | S_PrimaryEduPopAged25OrMore | S_UpperSecEduPctOfTotal | S_MScAttainedPopAged25OrMore | E_GDPPerCapitaConstant | E_GDPPerCapitaCurrent | E_GiniIndex | E_PovertyGap215PctOfPop | E_PovertyGap365PctOfPop | E_PovertyGap685PctOfPop | E_PovertyGap215Headcount | E_PovertyGap365Headcount | E_PovertyGap685Headcount | E_PovertyHeadcountPctOfPop | E_PovertyHeadcountPctOfAged17OrLess | E_ChildPovertyIndex | E_PovertyHeadcountPctOfHouseholds | E_TotalPovertyIndex | E_LaborTaxPctOfProfits | E_ProfitTaxPctOfProfits | E_TaxOnGoodsServicecPctOfRevenue | E_TaxOnGoodsServicecPctOfValAdded | E_TotalTaxPctOfProfits | E_ImportsPctOfGDP | E_ExportsPctOfGDP | E_ConsumerPriceInflation | E_FemaleUnemployment | E_MaleUnemployment | E_TotalUnemployment | E_TotalUnemploymentLocalEst | E_YouthUnemployment | E_YouthUnemploymentLocalEst | E_HealthExpenditurePctOfGDP | E_HealthExpenditurePerCapita | E_EducationExpenditurePctOfGDP | E_EduExpPerStudentPrimaryPctOfGDP | E_EduExpPerStudentSecondPctOfGDP | E_EduExpPerStudentTertiaryPctOfGDP | E_MilitaryExpenditurePctOfGDP | E_ResearchAndDevExpPctOfGDP | V_CO2EmissionsKgPerConstantGDP | V_CO2EmissionsKgPerGDP | V_CO2EmissionsKt | V_CO2PerCapita | V_AgriculturalLandPctOfTotal | V_ArableLandPctOfTotal | V_FertilizerUseKgPerHectare | V_ForestAreaPctOfTotal | V_TotalLandArea | V_ProtectedLandAreasPctOfTotal | V_TotalProtectedAreas | H_TotalSuicideRate | H_FemaleSuicideRate | H_MaleSuicideRate | H_AirPollutionMeanExpPctOfPop | H_AirPollutionOverExpPctOfPop | H_AccessToEssentialDrugs | H_InfantMortalityRate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AFG_1960 | AFG | Afghanistan | 1960 | NA | NA | NA | NA | NA | NA | 0.080 | 0.156 | 0.106 | 0.5 | 0.111 | 0 | NA | 8622466 | 3589290 | 4788899 | 244277 | 2606 | 724373 | 8.401 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.024793 | 4.132233 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| AFG_1961 | AFG | Afghanistan | 1961 | NA | NA | NA | NA | NA | NA | 0.083 | 0.165 | 0.111 | 0.5 | 0.112 | 0 | NA | 8790140 | 3665076 | 4877387 | 247678 | 6109 | 763336 | 8.684 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.097166 | 4.453443 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 57.87836 | 11.72899 | 0.1437908 | NA | 652230 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| AFG_1962 | AFG | Afghanistan | 1962 | NA | NA | NA | NA | NA | NA | 0.082 | 0.165 | 0.110 | 0.5 | 0.112 | 0 | NA | 8969047 | 3746296 | 4971702 | 251049 | 7016 | 805062 | 8.976 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.349593 | 4.878051 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 57.95502 | 11.80565 | 0.1428571 | NA | 652230 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| AFG_1963 | AFG | Afghanistan | 1963 | NA | NA | NA | NA | NA | NA | 0.085 | 0.172 | 0.134 | 0.5 | 0.105 | 0 | NA | 9157465 | 3835639 | 5067343 | 254483 | 6681 | 849446 | 9.276 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 16.863910 | 9.171601 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 58.03168 | 11.88231 | 0.1419355 | NA | 652230 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| AFG_1964 | AFG | Afghanistan | 1964 | NA | NA | NA | NA | NA | NA | 0.137 | 0.241 | 0.210 | 1.0 | 0.130 | 0 | NA | 9355514 | 3934872 | 5162530 | 258112 | 7079 | 896820 | 9.586 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 18.055555 | 8.888893 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 58.11600 | 11.95897 | 0.1410256 | NA | 652230 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
The data in here was already put together in the
WHR_data_prep.Rmd notebook, so in here, we merely need to
merge it with the data on happiness data. In here, it’s important to
take note of variables that may have too many missing
values as this might impact the overall data quality and make
some types of analysis unfeasible.
The table below shows all available indicators, their respective categories as well as the percentage of missing values in each column:
| Area | Indicator | PctMissing_AllTime | PctMissing_BaseYear | PctMissing_RefYear |
|---|---|---|---|---|
| Political | P_ControlOfCorruption | 0.0 | 0.0 | 0.0 |
| Political | P_PoliticalStability | 0.0 | 0.0 | 0.0 |
| Political | P_RuleOfLaw | 0.0 | 0.0 | 0.0 |
| Political | P_VoiceAndAccountability | 0.0 | 0.0 | 0.0 |
| Political | P_GovernmentEffectiveness | 0.0 | 0.0 | 0.0 |
| Political | P_CorruptionPerceptionIndex | 0.0 | 3.7 | 0.0 |
| Political | P_ElectoralDemocracyIndex | 0.1 | 0.0 | 0.0 |
| Political | P_FreedomOfExpression | 0.1 | 0.0 | 0.0 |
| Political | P_FreedomOfAssociation | 0.1 | 0.0 | 0.0 |
| Political | P_PopulationPctWithSuffrage | 0.1 | 0.0 | 0.0 |
| Political | P_CleanElectionsIndex | 0.1 | 0.0 | 0.0 |
| Political | P_ElectedOfficialsIndex | 0.1 | 0.0 | 0.0 |
| Political | P_ConflictsFatalityPerCountry | 47.0 | 55.6 | 43.4 |
| Societal | S_TotalPopulation | 0.7 | 0.0 | 0.7 |
| Societal | S_PopulationAged14OrLess | 0.7 | 0.0 | 0.7 |
| Societal | S_PopulationAged15To64 | 0.7 | 0.0 | 0.7 |
| Societal | S_PopulationAged65OrMore | 0.7 | 0.0 | 0.7 |
| Societal | S_NetMigration | 0.7 | 0.0 | 0.7 |
| Societal | S_UrbanPopulation | 0.7 | 0.0 | 0.7 |
| Societal | S_UrbanPopPctOfTotal | 0.7 | 0.0 | 0.7 |
| Societal | S_TotalHomicideRate | 10.4 | 7.4 | 9.0 |
| Societal | S_FemaleHomicideRate | 31.0 | 22.2 | 25.5 |
| Societal | S_MaleHomicideRate | 30.9 | 22.2 | 26.2 |
| Societal | S_FemaleBankingAccessPctOfPop | 27.3 | 100.0 | 0.7 |
| Societal | S_FemaleSchoolDropoutRate | 13.5 | 25.9 | 4.8 |
| Societal | S_FemaleManagementPctOfTotal | 35.7 | 33.3 | 22.8 |
| Societal | S_LaborParticipRateFemale | 0.7 | 0.0 | 0.7 |
| Societal | S_LaborForcePctFemale | 0.7 | 0.0 | 0.7 |
| Societal | S_FemaleLiteracyRate | 16.1 | 44.4 | 15.9 |
| Societal | S_AccessToCleanFuelsPctOfTotal | 3.7 | 3.7 | 4.1 |
| Societal | S_AccessToCleanFuelsPctOfRural | 3.7 | 3.7 | 4.1 |
| Societal | S_AccessToCleanFuelsPctOfUrban | 3.7 | 3.7 | 4.1 |
| Societal | S_AccessToElectricityPctOfTotal | 0.7 | 0.0 | 0.7 |
| Societal | S_AccessToElectricityPctOfRural | 0.8 | 0.0 | 0.7 |
| Societal | S_AccessToElectricityPctOfUrban | 0.7 | 0.0 | 0.7 |
| Societal | S_CompulsoryEducationYears | 4.3 | 0.0 | 3.4 |
| Societal | S_PreprimaryEducationYears | 11.5 | 33.3 | 2.8 |
| Societal | S_PrimaryEducationYears | 0.7 | 0.0 | 0.7 |
| Societal | S_SecondaryEducationYears | 0.7 | 0.0 | 0.7 |
| Societal | S_BScAttainedPopAged25OrMore | 44.9 | 92.6 | 14.5 |
| Societal | S_LowerSecEduPctOfTotal | 5.1 | 11.1 | 2.8 |
| Societal | S_PrimaryEduPopAged25OrMore | 11.8 | 22.2 | 6.9 |
| Societal | S_UpperSecEduPctOfTotal | 7.3 | 18.5 | 4.1 |
| Societal | S_MScAttainedPopAged25OrMore | 58.8 | 100.0 | 30.3 |
| Economic | E_GDPPerCapitaConstant | 2.8 | 3.7 | 2.1 |
| Economic | E_GDPPerCapitaCurrent | 1.2 | 0.0 | 0.7 |
| Economic | E_GiniIndex | 8.3 | 14.8 | 6.9 |
| Economic | E_PovertyGap215PctOfPop | 8.3 | 14.8 | 6.9 |
| Economic | E_PovertyGap365PctOfPop | 8.3 | 14.8 | 6.9 |
| Economic | E_PovertyGap685PctOfPop | 8.3 | 14.8 | 6.9 |
| Economic | E_PovertyGap215Headcount | 8.3 | 14.8 | 6.9 |
| Economic | E_PovertyGap365Headcount | 8.3 | 14.8 | 6.9 |
| Economic | E_PovertyGap685Headcount | 8.3 | 14.8 | 6.9 |
| Economic | E_PovertyHeadcountPctOfPop | 69.9 | 100.0 | 57.9 |
| Economic | E_PovertyHeadcountPctOfAged17OrLess | 75.5 | 100.0 | 65.5 |
| Economic | E_ChildPovertyIndex | 97.7 | 100.0 | 94.5 |
| Economic | E_PovertyHeadcountPctOfHouseholds | 93.8 | 100.0 | 89.0 |
| Economic | E_TotalPovertyIndex | 90.2 | 100.0 | 82.1 |
| Economic | E_LaborTaxPctOfProfits | 4.9 | 14.8 | 1.4 |
| Economic | E_ProfitTaxPctOfProfits | 4.9 | 14.8 | 1.4 |
| Economic | E_TaxOnGoodsServicecPctOfRevenue | 15.0 | 11.1 | 11.7 |
| Economic | E_TaxOnGoodsServicecPctOfValAdded | 16.7 | 18.5 | 12.4 |
| Economic | E_TotalTaxPctOfProfits | 4.9 | 14.8 | 1.4 |
| Economic | E_ImportsPctOfGDP | 2.3 | 0.0 | 2.1 |
| Economic | E_ExportsPctOfGDP | 2.3 | 0.0 | 2.1 |
| Economic | E_ConsumerPriceInflation | 2.8 | 7.4 | 2.1 |
| Economic | E_FemaleUnemployment | 2.1 | 0.0 | 1.4 |
| Economic | E_MaleUnemployment | 0.7 | 0.0 | 0.7 |
| Economic | E_TotalUnemployment | 0.7 | 0.0 | 0.7 |
| Economic | E_TotalUnemploymentLocalEst | 1.3 | 0.0 | 0.7 |
| Economic | E_YouthUnemployment | 0.7 | 0.0 | 0.7 |
| Economic | E_YouthUnemploymentLocalEst | 6.5 | 7.4 | 1.4 |
| Economic | E_HealthExpenditurePctOfGDP | 2.5 | 0.0 | 2.1 |
| Economic | E_HealthExpenditurePerCapita | 2.5 | 0.0 | 2.1 |
| Economic | E_EducationExpenditurePctOfGDP | 3.1 | 0.0 | 2.1 |
| Economic | E_EduExpPerStudentPrimaryPctOfGDP | 18.9 | 37.0 | 12.4 |
| Economic | E_EduExpPerStudentSecondPctOfGDP | 20.5 | 33.3 | 13.1 |
| Economic | E_EduExpPerStudentTertiaryPctOfGDP | 16.1 | 18.5 | 10.3 |
| Economic | E_MilitaryExpenditurePctOfGDP | 3.7 | 0.0 | 4.1 |
| Economic | E_ResearchAndDevExpPctOfGDP | 15.7 | 3.7 | 12.4 |
| Environmental | V_CO2EmissionsKgPerConstantGDP | 4.0 | 3.7 | 3.4 |
| Environmental | V_CO2EmissionsKgPerGDP | 2.5 | 0.0 | 2.1 |
| Environmental | V_CO2EmissionsKt | 1.9 | 0.0 | 2.1 |
| Environmental | V_CO2PerCapita | 1.9 | 0.0 | 2.1 |
| Environmental | V_AgriculturalLandPctOfTotal | 0.7 | 0.0 | 0.7 |
| Environmental | V_ArableLandPctOfTotal | 0.7 | 0.0 | 0.7 |
| Environmental | V_FertilizerUseKgPerHectare | 0.7 | 0.0 | 0.7 |
| Environmental | V_ForestAreaPctOfTotal | 1.3 | 0.0 | 1.4 |
| Environmental | V_TotalLandArea | 0.7 | 0.0 | 0.7 |
| Environmental | V_ProtectedLandAreasPctOfTotal | 57.5 | 100.0 | 0.7 |
| Environmental | V_TotalProtectedAreas | 87.4 | 85.2 | 88.3 |
| Health-related | H_TotalSuicideRate | 1.9 | 0.0 | 2.1 |
| Health-related | H_FemaleSuicideRate | 1.9 | 0.0 | 2.1 |
| Health-related | H_MaleSuicideRate | 1.9 | 0.0 | 2.1 |
| Health-related | H_AirPollutionMeanExpPctOfPop | 1.3 | 0.0 | 1.4 |
| Health-related | H_AirPollutionOverExpPctOfPop | 1.3 | 0.0 | 1.4 |
| Health-related | H_AccessToEssentialDrugs | 99.0 | 100.0 | 99.3 |
| Health-related | H_InfantMortalityRate | 92.0 | 92.6 | 91.7 |
As we saw in the table with the various indicators above, we do have a rather elevated share of missing data for some columns. This makes these columns rather unusable for analytic purposes if left untreated. Therefore, we need to have a strategy for dealing with missing data.
However, as the chart below shows, the availability of the data varies across time, so it may not be wise to apply the same strategy:
Specifically, we can see that data is missing a lot more often in the
BaseYear than in the ReferenceYear, with the
entirety of the historical data lying in-between. This indicates the
improvement of data quality across time.
In this and the subsequent section, we work with the following three datasets:
DataForAnalysis, containing the entirety of the
historical data
DataForBaseYear, containing only data from the base
year we’ve selected to denote the past state of happiness
DataForReferenceYear, containing only data from the
reference year we’ve selected to denote the current state of
happiness
By definition, we will entirely remove columns that contain more than 10% missing data. A summary of the number of columns before and after the removal of the problematic variables is printed out below:
| Dataset | ColumnsBefore | ColumnsAfter | ColumnsDropped | PctDropped |
|---|---|---|---|---|
| All time | 106 | 79 | 27 | 25.5 |
| Base year | 106 | 69 | 37 | 34.9 |
| Reference year | 106 | 85 | 21 | 19.8 |
The next part in the process is slightly more tricky as we do have
many different columns which may have a very different nature. In here,
we will be using the missForest package to impute for all
missing values in all variables. This is not done on a per-country basis
as certain countries have no data for some fields, making it impossible
to create meaningful imputations. By doing the imputation at the
global level, we’re able to benefit from cross-country pattern
recognition and discovering inter-dependencies between variables.
The technique used in here is based on a series of automatically defined and fitted random forest (RF) models (each model is cross-validated 5 times). The advantage of using this method is that there are no expectations as to the distribution of the observations and also the fact that we can easily get some summary stats on how accurate the forecasts will likely be.
To further improve the reliability of the imputation, we create what is essentially a series of 10 datasets with different imputed values. By doing so, we account for the randomness of the missing observations as we can later on combine these datasets into one and run our subsequent analysis on all of them. This will increase the number of observations but in a proportional manner, so the original values will matter in exactly the same way as they did before the imputation was performed. Please note that these duplicates are dealt with further down the line.
To prevent errors from affecting our imputation, we remove countries which have less than 5 observations in the all-time historical data (otherwise, the RF models will not be able to produce any results). An overview of the countries we’re removing from our analysis is printed below:
## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.
| Country | CountryCode | NumberOfRows |
|---|---|---|
| Angola | AGO | 4 |
| Belize | BLZ | 2 |
| Bhutan | BTN | 3 |
| Congo | COG | 1 |
| Cuba | CUB | 1 |
| Djibouti | DJI | 4 |
| Eswatini | SWZ | 3 |
| Eswatini, Kingdom of | SWZ | 1 |
| Gambia | GMB | 4 |
| Guyana | GUY | 1 |
| Maldives | MDV | 1 |
| Oman | OMN | 1 |
| Somalia | SOM | 3 |
| South Sudan | SSD | 4 |
| Suriname | SUR | 1 |
Please note that process of imputing data via RF models may take some time to complete as it can be computationally intensive. Some errors may be generated in some cases as we may have oddly-looking data within some countries/indicators (e.g. not enough predictors/observations to generate RF models).
To get over the artificial increase in number of observations caused by the random forest imputation technique, we will group all rows by year and country and then use the mean value for each of the indicators. If any missing values remain after this step, we will group the rows by year only and then use that average (this second part only concerns less than 70 rows, or less than 3.5% of all rows, and applies to entities not universally recognized as independent countries such as Hong Kong and the Palestinian Authority).
As a final check, we once again see whether we have any missing values after this step of the process:
## `summarise()` has grouped output by 'Country', 'CountryCode', 'RowID',
## 'Continent', 'Region'. You can override using the `.groups` argument.
Before continuing, we check whether we have any missing countries in the dataset containing the imputed observations:
## [1] "No countries were found to be missing in the dataset containing the imputed values."
Furthermore, we check whether there are any persisting missing values even after the imputation (this shouldn’t be the case):
## [1] "There are no missing values in the data frame containing the imputations."
As we applied the imputation on the all-time historical data but we now need to split the data so we have separate datasets for the base and the reference years. We need to do this so we can stay true to our original methodology, where we completely excluded columns which had more than 10% missing data.
After these adjustments are applied, we export all three datasets so that the analysis can continue in another notebook. With this, we’re finally ready to start modelling the data!